Summary of out Data

Here is a quick summary of our data. We see how many times values occured in our data frame as well as what were the quartiles values for our data.

summary(sales_final)
##        sku        customer_name                ship_city    ship_state
##  FNG00015:10   CUSTOMER 1:58    TLAJOMULCO DE ZUNIGA:12   MX     :26  
##  FNG00014: 8   CUSTOMER 3: 9    CUAUTITLAN IZCALLI  :10   TB     :18  
##  FNG00030: 7   CUSTOMER 6: 9    VILLAHERMOSA        :10   JA     :15  
##  FNG00031: 7   CUSTOMER 2: 2    CHALCO              : 8   NL     :14  
##  FNG00033: 7   CUSTOMER 4: 2    CULIACAN            : 8   SI     : 8  
##  FNG00043: 7   CUSTOMER 5: 2    MONTERREY           : 8   CH     : 1  
##  (Other) :37   (Other)   : 1    (Other)             :27   (Other): 1  
##     zip_code  shipping_method package_24_shipment   cases_sold    
##  86280  :12   Delivery:83     Min.   :   175      Min.   :   252  
##  45679  : 8                   1st Qu.:  8898      1st Qu.: 12264  
##  56640  : 8                   Median : 20664      Median : 24840  
##  66550  : 8                   Mean   : 58039      Mean   : 47306  
##  80130  : 8                   3rd Qu.: 73386      3rd Qu.: 62118  
##  54769  : 6                   Max.   :326400      Max.   :213612  
##  (Other):33                                                       
##    avg_spend     
##  Min.   : 1.566  
##  1st Qu.: 7.142  
##  Median :12.391  
##  Mean   :10.950  
##  3rd Qu.:15.707  
##  Max.   :16.975  
## 

Visulization of Customers

Now lets take a much closer look and see in which cities customers are buying these water cases.

ggplot(sales_final, aes(x = customer_name, fill = ship_city))+
  geom_bar(color = 'black')+
  labs(x = 'Customer',  y= 'Where that customer is buying from', fill = 'City')

Which cities are in what states

JA
## [1] "LAJOMULCO DE ZUNIGA" "EL SALTO"
MX
## [1] "TULTITLAN"          "CHALCO"             "CUAUTITLAN IZCALLI"
## [4] "TEOLOYUCAN"         "RAMOS ARIZPE"
NL
## [1] "GUADALUPE"         "MONTERREY"         "ESCOBEDO"         
## [4] "CIENEGA DE FLORES"
SI
## [1] "CULIACAN"
TB
## [1] "VILLAHERMOSA TABASCO" "VILLAHERMOSA"

Visualization of Cases Sold

Here we show which city had the most sold cases. What was the breakdown of the shipments to each city. We can see which cities are the ones who are buying the most cases and which ones are buying the average amount of around 47306 cases. We can make cost efficient decisions in the next quarter so to maximize our orders with those cities that are buying the most cases.

This is the total amount of cases that were shipped to each city

print(cities)
## # A tibble: 16 x 2
##    ship_city            total_sold
##    <fct>                     <int>
##  1 CHALCO                   410328
##  2 CHIHUAHUA                 12096
##  3 CIENEGA DE FLORES          4032
##  4 CUAUTITLAN IZCALLI       930676
##  5 CULIACAN                 541140
##  6 EL SALTO                  12628
##  7 ESCOBEDO                 197821
##  8 GOMEZ PALACIO              4032
##  9 GUADALUPE                 60192
## 10 MONTERREY                670048
## 11 RAMOS ARIZPE                616
## 12 TEOLOYUCAN                76552
## 13 TLAJOMULCO DE ZUNIGA     501454
## 14 TULTITLAN                 37392
## 15 VILLAHERMOSA             177761
## 16 VILLAHERMOSA TABASCO     289657

A pie chart of the percentage cases bought by each city

pie

A histogram representation

Here we see the frequency of cases were bought by a specific city. We can see which cities are buying larger orders at a time or who is buying lower quantities.

ggplot(sales_final, aes(x = cases_sold, fill = ship_city))+
  geom_histogram(color = 'black')+
  labs(x = 'Cases Sold', fill = "City")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Showing our Data on a Map

Here we see that I divided up the information and plotted it on a map of Mexico the busiest place that the cases were shipped for Q1 of 2017. This is good to visualize where we can focus more of the shipments in the future as to be sure to maximize shipments and profits.

ggplot(mexico, aes(x = long, y = lat)) +
  geom_polygon(aes(group = group), fill = "white",
               color = "gray40", size = .2) +
  geom_point(data = sales_final2, aes(x = long, y = lat, size = cases_sold, color = ship_city))+
  scale_size_continuous(range = c(8,15))+
  labs(color = 'City of Destination', size = "Amount of Cases Sold", title ='Comparing the volume of shipment to each city in Mexico')

Lets look at the Distribution of Shipment Cost to each City

Here we have a boxplot that shows us the median price of shipping across all the cities in Q1. We see what the maximum value and minimum value were for shipping to that respective city. The average cost of shipment was $10.95.

ggplot(sales_final, aes(x = reorder(ship_city, avg_spend, median), y = avg_spend))+
  geom_boxplot()+
  theme_bw()+
  theme(axis.text.x=element_text(angle=45,hjust=1))+
  labs(x = "City", y = "Shipment Cost per Case(in Dollars)")

Regression between Cost per Case to Chip, The Amount of Cases Shipped, and 24 Pack Cases

Here we are showing the average shipping price of shipping cases of water against how many cases where sold at that price point. We then want to see how many complete 24 packs of water of those cases sold compare against to the price of the shipment. Since different products have different bottle sizes and pack sizes, so “24-pack equivalent” cases are used as a scale for a comparison between different product types.

ggplot(sales_final, aes(x = avg_spend, y = cases_sold))+
  geom_point(aes(size = package_24_shipment), alpha = 1/3)+
  labs(x = 'freight cost', y = 'cases sold', size = '24-pack equivalents')

Now we want to smooth out this data to see what sort of relationships each state state has with regards to the cases sold, shipping cost, and 24 pack equivalents. We can determine what relationship our data has form this information. As cases start to get larger we see which states comsume the most of our product, while others start to fall off or comsume less. We can also see at what amount states start to increase their consumptions. This informations is valuable as it would lead to know how many cases to estimate to each state which can result in having a more efficient second quarter.

ggplot(sales_final, aes(x = avg_spend, y = cases_sold))+
  geom_point(aes(size = package_24_shipment), alpha = 1/3)+
  geom_smooth(aes(color = ship_state), se = FALSE) +
  labs(x = 'freight cost', y = 'cases sold', size = '24-pack equivalents')
## `geom_smooth()` using method = 'loess'

Forecasting future cases sold in the next quarter Q2

With the limited data that we have of cases sold in the Q1 we will create a simple cases sold forecast to see what we can expect in the next quarter as far as cases sold go. Here we will provide insight into the values that were used to make the best fitted stastical prediciton.

Alpha and Beta Values

Since we are working with predictions we need to have a fitted alpha and beta that will fit our data by having the lowest sum of squared errors of prediction so that we know our data is fitted to the best values and will yield more accurate results.

alpha
## [1] 0.1727234
beta
## [1] 0.1482043

Forecast percentage growth over the next 90 days for Q2 of cases sold

This is what the predicted percent we can see in the growth of cases sold in Q2. We can see the predicted percentage growth over 30, 60, 90 days into the future.

growth_over_30
## [1] "4.62%"
growth_over_60
## [1] "9.4%"
growth_over_90
## [1] "14.2%"

The graphical visualization of our predicted growth of cases sold for Q2

Finally here is a graphical representation of out cases sold forecasting in Q2. We have out blue line as out predicted forecast and our upper and lower bounds with the red lines. When we have the actual data we can see how close the predicted data was to the actual data and calculate the residual of the values.

plot.ts(DT$V1,  main = "Cases Sold Predictions for Q2", xlab = "Days", ylab = 'Cases Sold', xlim = c(10,185), ylim = c(0, 200000))
lines(cases_sold.mean$fitted[,1], col = 'green') #the fitted is for all the fittted values, the 1 is for the first column
lines(cases_sold.pred[,1], col='blue') # the fit columns
lines(cases_sold.pred[,2], col = 'red') #upper column
lines(cases_sold.pred[,3], col = 'red') #lower columns

Conclusion

With this data we can now make accurate estimations for quarter 2 of sales and freight. We can estimate how much it would cost to ship to these same locations in quarter 2 as well as know about how much each respective state and city will consume in the future. The data also tells us what the perfect amount of cases would be for each state as to now have them start to consume less product; therefore, making our shipment process much more efficient as we will know how much a specific state/city will expect to consume.